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Abstract 

We investigate the asymptotic optimality of a large class of multiple testing rules using the 
framework of Bayesian Decision Theory. We consider a parametric setup, in which observations 
come from a normal scale mixture model and assume that the total loss is the sum of losses for 
individual tests. Our model can be used for testing point null hypotheses of no signals (zero effects), 
as well as to distinguish large signals from a multitude of very small effects. The optimality of a rule is 
proved by showing that, within our chosen asymptotic framework, the ratio of its Bayes risk and that 
of the Bayes oracle (a rule which minimizes the Bayes risk) converges to one. Our main interest is in 
the asymptotic scheme under which the proportion p of "true" alternatives converges to zero. We 
fully characterize the class of fixed threshold multiple testing rules which are asymptotically optimal 
and hence derive conditions for the asymptotic optimality of rules controlling the Bayesian False 
Discovery Rate (BFDR). We also provide conditions under which the popular Benjamini-Hochberg 
and Bonferroni procedures are asymptotically optimal and show that for a wide class of sparsity 
levels, the threshold of the former can be approximated very well by a non-random threshold. 
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Our results show that for optimal performance the BFDR (or FDR) controlling level should be 
chosen to be small if the expected signal magnitude or the relative cost of a type I error is large. We 
also show that for a wide range of sparsity levels (i.e. rates of convergence of p to zero) and expected 
signal magnitudes, the Benjamini-Hochberg rule controlling the FDR at a fixed level CH G (0, 1) is 
asymptotically optimal; provided only that the ratio of losses for type I and type II errors converges 
to zero at a slow rate which can vary quite widely. When the loss ratio is constant, similar optimality 
results hold if the FDR controlling level slowly converges to zero as p — ^ 0. As far as we know, 
this is the first proof of the decision theoretic asymptotic optimality of the Benjamini-Hochberg rule 
in the context of hypothesis testing. 

1 Introduction 

Multiple testing has emerged as a very important problem in statistical inference, 
because of its applicability in understanding large data sets involving many param- 
eters. A prominent area of the application of multiple testing is microarray data 
analysis, where one wants to simultaneously test expression levels of thousands of 
genes (e.g. see [15], [H], [35], [18], [25], [26], [27] or [M])- Various ways of performing 
multiple tests have been proposed in the literature over the years, typically differing 
in their objective. Among the most popular classical multiple testing procedures, 
one could mention the Bonferroni correction, aimed at controlling the family wise 
error rate (FWER), and the Benjamini-Hochberg procedure (^), which controls the 
false discovery rate (FDR). A wide range of Empirical Bayes (e.g. see [13], [H], [15] . 
[37] and [1]) and full Bayes tests ( see e.g. [25], [8], [27] and [1] ) have also been 
proposed and are used extensively in such problems. 

In recent years, substantial efforts have been made to understand the properties 
of multiple testing procedures under sparsity, i.e. in the case when the proportion p 
of "true" alternatives among all tests is very small. We cite a few among many im- 
portant papers, ([9], [To], [21], [20], [5]). A major theoretical breakthrough was made 
in [T], where, in a problem of estimating a sparse vector of means, a data-dependent 
thresholding estimator for the unknown means is proposed, the threshold being de- 
termined by applying the Benjamini-Hochberg procedure (henceforth denoted as 
BH). Specifically, in [1] it is shown that this estimator adapts very well to the un- 
known sparsity parameter p and is asymptotically minimax over a wide range of 
sparse parameter spaces and loss functions. 

In this paper we analyze the properties of multiple testing rules from the per- 
spective of Bayesian Decision Theory. We assume fixed losses 6o and Sa for type I 
and type II errors, respectively, for each test and define the overall loss of a multiple 
testing rule as the sum of the losses incurred in each individual test. We feel that 
such an approach is natural in the context of testing, where the main goal is to 
detect significant signals, rather than estimate their magnitude. In the specific case 
where 6q = Sa = 1, the total loss is equal to the number of misclassified hypotheses. 
The main result of this paper is the proof of the asymptotic optimality properties of 
BH within this Bayesian perspective. BH is a very interesting procedure to analyze 
from this point of view, since, despite its frequentist origin, it shares some of the 
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major strengths of Bayesian methods. Specifically, as shown in [T3] and [T7], BH 
can be understood as an Empirical Bayes approximation to the procedure control- 
ling the "Bayesian" False Discovery Rate (BFDR). This approximation relies mainly 
on estimation of the distribution generating the data by the empirical distribution 
function. In this way, similarly to standard Bayes methods, it gains strength by 
combining information from all the tests. The major issue addressed in this paper 
is the relationship between BFDR control and optimization of the Bayes risk. Our 
research was motivated mainly by the good properties of BH with respect to the 
misclassification rate under sparsity, documented in [T7], [3] and The present 
paper lends theoretical support to these experimental findings, by specifying a large 
range of loss functions for which BH is asymptotically optimal in a Bayesian Decision 
Theoretic context. 

We consider multiple testing where our observations are assumed to come from 
a normal scale mixture model (see fl2.6p below). This model has been used earlier 
in the context of multiple testing (see, e.g., [2Z], [3] and [1]) and differs from the 
model used in [1] by imposing a normal prior distribution on the unknown vector of 
means. As discussed in Section 2, depending on the form of the assumed mixture 
distribution, this model can be used for testing point null hypotheses to decide if 
the unknown means are zero, as well as for identifying large signals embedded in a 
multitude of very small effects. In this situation, each individual test tries to decide 
which component of the mixture generated the corresponding data. For the rest 
of the paper we will use the generic term "signal" to refer to the unknown mean 
value under the alternative. Under an additive loss function, we first find the Bayes 
rule which minimizes the overall risk (the Bayes risk), and this rule is henceforth 
referred to as the Bayes oracle. The Bayes oracle turns out to be a rule which 
applies a fixed threshold critical region (of the form Yi > K) for each individual test. 
This threshold and the properties of the Bayes oracle depend on three parameters: 
the sparsity level p, ratio of losses = |j and average squared signal magnitude u, 
defined as the ratio between the variances of the non-null and null components of the 
mixture. In our asymptotic considerations, we mainly consider the scenario where p 
goes to zero and log 5 = o(logp). We observe that in this situation the Bayes oracle 
has an asymptotic power larger than zero if and only if u increases to infinity, such 
that — 1^ — ^ Cu G (0, oo]. We concentrate our attention on such detectable signals 
and classify a multiple testing rule as asymptotically optimal if, in this setting, the 
ratio of its Bayes risk to that of the Bayes oracle converges to one. We place special 
emphasis on the case where Cu < oo. In this situation, the asymptotic power of the 
Bayes oracle is smaller than one and we classify the signals n oc — log p as signals 
on "the verge of detectability" . Specifically, if p oc m~^, where m is the number of 
tests and [3 G (0, oo), then signals on the verge of detectability satisfy 

>Cu G (0,cx)) . (1.1) 

logm 



This result is quite natural, since, under sparsity, the magnitude of the largest 
test statistic corresponding to the null hypothesis is of the order 2 logm. Thus, 
signals increasing to infinity at a rate slower than logm cannot be distinguished 
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from the largest components of the noise. In a frequentist setting, a similar scaling 
for asymptotically detectable signals was proposed in [9] (with u replaced by the 
square of a true mean). 

In the first part of this paper we study fixed threshold tests in great detail and 
fully characterize the class of asymptotically optimal fixed threshold testing rules. 
Using this, we specify conditions for the asymptotic optimality of the "universal 
threshold" 2 log m of [11] and the closely related Bonferroni correction. We also 
provide conditions for the asymptotic optimality of fixed threshold multiple testing 
rules which control the Bayesian False Discovery Rate (BFDR) at a given level a. 
It turns out that the optimal choice of a depends on the expected signal magnitude, 
u, and the ratio between the losses, S. Broadly speaking, a should decrease when u 
or 6 increases. Our results also show that for a wide range of choices of the sparsity 
level p and expected signal magnitude u, a rule controlling the BFDR at a fixed 
level a is asymptotically optimal if the ratio between losses, S, converges to zero 
at a suitably slow rate. In the case where the sequence of sparsity levels satisfies 
Pm oc and signals are on the verge of detectability, our results take an especially 
simple form. Specifically, we prove that under this scenario a rule controlling the 
BFDR at a fixed FDR level a G (0, 1) is asymptotically optimal for a wide range of 
loss ratios 5m satisfying 

1 r 

5^->0 and -^i^->0 (1.2) 

logm 

(see Corollary 15. 5p . The assumption that 5^ — )■ as — ^ agrees with the 
intuition that the cost of missing a signal should be relatively large if the true 
number of signals is small. 

The final results of the paper are included in Section 6, where we prove some 
optimality properties of the Benjamini-Hochberg procedure. Here we assume that 
m — )■ oo and — ^ in such a way that mpm Cp E (0, oo]. We distinguish 
between two cases. In the case where 

log^m 

Pm > , for some p > 1 , (1.3) 

m 

the proof can be based on a comparison with a fixed threshold BFDR control rule. 
Specifically, the approximation of the random threshold used in the BH procedure 
by the threshold of the BFDR control rule works at this level of sparsity. The 
assumption (11. 3p is similar to the one used in p!] for proving the optimality of 
BH, though our results are proved in a substantially different asymptotic context. 
Furthermore, BH is also shown to be asymptotically optimal for the extremely sparse 
case, where Pm = Zm/iTL, such that Zm converges to a finite positive constant or 
diverges to infinity in such a way that log Zm = o(logm). In this situation the 
type I error component of the risk is bounded by invoking the results of [16j on the 
expected number of type I errors under BH, while the bound on the type II error 
component of the risk follows from a comparison of BH with the Bonferroni rule, 
which is asymptotically optimal for this range of Pm- 

Our results show that for a wide range of choices of the mixture parameters, 
the BH rule shares the asymptotic optimality properties of the BFDR control rules 
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discussed above, and adapts very well to the unknown sparsity. Specifically, in 
Corollary 16.21 we show that for any sequence of sparsity levels pm, satisfying 

for some constants ai G (0, oo), 02 G (0, 00) and 03 G (0, 1) 

aim"^ <Pm< 02"^""' , (1-4) 

and any sequence of loss ratios satisfying fll.2p . a BH rule with a fixed FDR level 
a G (0, 1) is asymptotically optimal for all signals on the verge of detectability (i.e. 
signals satisfying (11. ip ). In comparison to [1], our general results give some hints on 
how the optimal FDR level should be chosen, depending on the expected magnitude 
of the signal and the ratio between the losses. As far as we know, this is the first 
thorough discussion of the decision theoretic optimality of the Benjamini-Hochberg 
procedure in the context of hypothesis testing. 

As already mentioned, we place great emphasis on discussing asymptotic op- 
timality for signals "on the verge of detectability". We believe that rules which 
perform well in this region could be used as a kind of "gold standard" when not 
much information about the magnitude of the signal is available. But our optimality 
results are of a more general nature and specify optimality conditions for the whole 
range of detectable signals. 

The outline of the paper is as follows. In Section 2 we define and discuss our 
model. In Section 3 we introduce the decision theoretic and asymptotic framework 
of the paper. We present the Bayes oracle, which minimizes the Bayes risk, and 
formulate the conditions under which the asymptotic power of this rule is larger 
than 0. We also provide a formula for the optimal Bayes risk. In Section 4 we give 
a definition of asymptotic optimality in terms of the Bayes risk and characterize 
the fixed threshold multiple testing rules which are asymptotically optimal. Section 
4 also contains two examples of asymptotically optimal rules, which are related 
to the "universal threshold" 21ogm of |TT]. In Section 5 we discuss the Bayesian 
False Discovery Rate and give conditions under which controlling the BFDR is 
asymptotically optimal. We also provide conditions for the asymptotic optimality of 
the Bonferroni correction and relate the asymptotic approximation of the Benjamini- 
Hochberg random threshold to the threshold of the BFDR control rule. Section 6 
contains results on the asymptotic optimality of the BH procedure, while Section 7 
contains a discussion and directions for further research. The majority of the proofs 
can be found in the Appendix. 

2 Statistical model 

In this section we introduce the normal scale mixture model, in the context of which 
we study multiple testing. We will explain below that this model, previously apphed 
in [27] and can be used both for testing point null hypotheses, as well as for 
distinguishing a small number of relatively large signals from a multitude of very 
small effects. We believe that the latter case is much more realistic in large scale 
multiple testing applications, e.g. microarray studies. 
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Suppose we have m independent observations Xi, . . . , and assume that each 
Xi has a normal N{fii, a^) distribution. Here Hi represents the effect under inves- 
tigation and 0"^ is the variance of the random noise (e.g. the measurement error). 
We assume that each /Xj is an independent random variable, with distribution de- 
termined by the value of the unobservable random variable z/j, which takes values 
and 1 with probabilities 1 — p and p respectively, for some p G (0, 1). We denote 
by Hqi the event that z/j = 0, while Hai denotes the event Vi = 1. We will refer to 
these events as the null and alternative hypotheses. Under ifoi; A^i is assumed to 
have a A^(0, ctq) distribution (where (Tq > 0), while under Hai it is assumed to have 
a A^(0, ctq + r^) distribution (where > 0). Hence, we are really modelling the //j's 
as iid rv's from the following mixture distribution: 

/i, ~ (1 - p)iV(0, al) + piV(0, al + r^) . (2.5) 

This implies that the marginal distribution of Xi is the scale mixture of normals, 
namely, 

X, ~ (1 - p)X(0, a^) + pX(0, + r^) , (2.6) 

where = al + a^. 

We will use the term "sparse mixture" to refer to the situation when p {). 

Note that in the case where ctq = 0, H^i corresponds to the point null hypothesis 
that /ii = 0. Allowing ctq > greatly extends the scope of the apphcations of the 
proposed mixture model under sparsity. In many multiple testing problems it seems 
unrealistic to assume that the vast majority of effects are exactly equal to zero. E.g., 
in the context of locating genes influencing quantitative traits, it is typically assumed 
that a trait is influenced by many genes with very small effects, so called polygenes. 
Such genes form a background, which can be modeled by the null component of the 
mixture. In this case the main purpose of statistical inference is the identification of 
a small number of significant "outliers", whose impact on the trait is substantially 
larger than that of the polygenes. These important "outlying" genes are modeled 
by the non-null component of the mixture. 

In the remaining part of the paper we will assume that the variance of Xj under 
the null hypothesis, o"^, is known. This assumption is often used in the literature 
on the asymptotic properties of multiple testing procedures (see e.g., [9] or [1]). 
However, in practical applications a is often unknown and needs to be estimated. 
In the case of a simple null hypothesis (i.e. when cTq = 0), o"^ can be precisely 
estimated by using replicates of X,. In the case where ctq > the situation is more 
difficult, but cr^ can still be estimated by pooling the information from all the test 
statistics and applying Empirical Bayes methods (see e.g., [1]). Some discussion on 
the issue of estimating the parameters in sparse mixtures is provided in Section [71 

Remark 2.1 The proposed mixture model for Xj is a specific example of the two- 
groups model, which was discussed in a wider nonparametric context e.g in [T3] . 
[13] . [T8] and [1]. Restricting attention to normal mixtures allows us to reduce the 
technical complexity of the proofs and to concentrate on the main aspects of the 
problem. We believe that similar results also hold in a substantially more general 
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setting, e.g. when the normal distribution is replaced by another suitable scale 
distribution, with a large scale parameter under the alternative. 



3 The Bayes oracle 

We consider a Bayesian decision theoretic formulation of the multiple testing prob- 
lem of testing H^i versus H^i, for each i = 1, . . . ,m simultaneously. For each i, there 
are two possible "states of nature", namely H^i or H^i, that occur with probabilities 
(1 — p) and p, respectively. As indicated in Section [2l under Hqi, Xj ~ N{0,(t'^), 
while under H^i, Xi ~ A^(0, + r^). Table 1 defines the matrix of losses for making 
a decision in the i*^^ test. 



Ta 


3le 1: Matrix of losses 




Choose Hoi 


Choose HAi 


Hoi true 
Hai true 




Sa 


So 




We assume that the overall loss in the multiple testing procedure is the sum of 
losses for individual tests. Thus our approach is based on the notion of an additive 
loss function, which goes back to [21] and |22], and seems to be implicit in most of 
the current formulations. 

Under an additive loss function, the compound Bayes decision problem can be 
solved as follows. It is easy to see that the expected value of the total loss is mini- 
mized by a procedure which simply applies the Bayesian classifier to each individual 
test. For each i, this leads to choosing the alternative hypothesis Haz in cases such 
that 

MXi) {i-p)So 

MX.) - pSa ' ^ ^ 

where (pA and 0o are the densities of X^ under the alternative and null hypotheses, 
respectively. 

After substituting in the formulas for the appropriate normal densities, we obtain 
the optimal rule: 



X. 



2 



Reject Hoi if ^ > c% (3.8) 



where 

c' = c's = ^— T- ( log ( ( - ) + 1 1 + 21og(/5) 1 (3.9) 




with / = and S = j^. We call this rule a Bayes oracle, since it makes use 
of the unknown parameters of the mixture and therefore is not attainable in finite 
samples. 

Remark 3.1 The Bayes oracle for multiple tests as defined above was introduced 
independently in and |1] . Two other oracles for multiple tests have recently been 
proposed in [M] and [3^. They are both based on the principles of classical statistics 
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and aim to maximize the number of true discoveries, while keeping the expected 
number of false positives or false discovery rate at a given level. Interestingly, 
Sun and Cai [36] point out a relationship between their oracle and the Bayes oracle 
(13. 7p . In practical applications both of the methods proposed in [3l] and [36] require 
estimation of the parameters of the mixture distribution. The asymptotic results 
given in [36J illustrate the optimality of the corresponding multiple testing procedure 
in the proposed classical context and for any fixed (though unknown) p G (0, 1). 

The oracle (13. 7p . considered in this manuscript, is motivated by traditional 
Bayesian decision theory and minimizes a weighted average of the misclassification 
errors of both types. We are interested in identifying multiple testing procedures 
that are asymptotically as good as this oracle. We consider the case where p tends 
to zero, which requires rather subtle methods. Specifically, under this scenario only 
relatively large signals have a chance of being detected by a Bayes oracle. The cor- 
responding assumption (A), specifying the range of detectable signals, is proposed 
in Section 13.1.1 In a frequentist setting, a similar scaling of the asymptotically de- 
tectable signals was introduced in [9] . Some extensions of the latter are obtained in 

m- 

Using standard notation from the theory of testing, we define the probability of 
a type I error as 

hi = PhoAHoi is rejected) 
and the probability of a type II error as 

t2i = Pha^ (Hoi is accepted) . 

Note that under our mixture model the marginal distributions of Xj under the 
null and alternative hypotheses do not depend on i and the threshold of the Bayes 
oracle is also the same for each test. Hence, when calculating the probabilities of 
type I errors and type II errors for the Bayes oracle, we can, and will henceforth, 
suppress i from tn and t2i- The same remark also applies to any fixed threshold 
procedure which, for each i, rejects i^oi if /^"^ > ^ for some constant K. 

In the remainder of this section we provide formulas for the probabilities of type 
I and type II errors using the Bayes oracle and calculate the corresponding Bayes 
risk. We also introduce the asymptotic framework used in this article. 

3.1. Type II errors and the asymptotic framework 

We now want to motivate the asymptotic framework which will be formally intro- 
duced below as Assumption (A). 

Let 7 = (p, r^, 0"^, (5o, 5a) be the vector of parameters defining the Bayes oracle 
(13. 9p . In our asymptotic analysis, we will consider infinite sequences of such 7's. 
A natural example of such a situation arises when the number of tests m increases 
to infinity and the vector 7 varies with the number of tests m. But here we are 
actually trying to understand, in a unified manner, the general limiting problem 
when 7 varies through a sequence. 
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The threshold (13. 9p depends on r and a only through u = y^j . Note that u is 
a natural scale for measuring the strength of the signal in terms of the variance of 
Xi under the null. We also introduce another parameter v = up 5"^^ which can be 
used to simplify the formula for the optimal threshold 

cl, = (l + ^) (log^ + log(l + • (3.10) 

Observe that under the alternative ^ has a normal A^(0, 1 + u) distribution. 
Thus the probability of a type II error using the Bayes oracle is given by 

t2 = P {Z^ < ^<.) , (3.11) 

where Z is a standard normal variable. 

From (13. lip it follows that given an arbitrary infinite sequence of 7's, the limiting 

power of the Bayes oracle is non-zero only if the corresponding sequence remains 
bounded. We will restrict ourselves to such sequences, since otherwise even the Bayes 
oracle cannot guarantee non-trivial inference in the limit and all rules will perform 
poorly. 

The focus of this paper is the study of the inference problem when p — > 
and the goal is to find procedures which will efficiently identify signals under such 
circumstances. To clarify these ideas, consider the situation where j9 — > and 
log{5) = o(logj9). It is immediately evident from (13. 9 p that in this situation = 

diverges to infinity. Hence remains bounded only when the signal magnitude u 

1 

diverges to infinity, in which case ~ This explains two of the three asymp- 
totic conditions we impose below in Assumption (A). The third condition f — )■ 00 
pragmatically ensures that 6 is not allowed to converge to zero too quickly. 

Assumption (A): A sequence of vectors {•jt = {Pt,Tt,<^h^ot,SAt)', 
t G {1,2, . . .}} satisfies this assumption if the corresponding sequence of parame- 
ter vectors, 6t = {ut,Vt), fulfills the following conditions: Ut ^ 00, Vt ^ 00 and 
i^Si^^Ce [O,cx)),ast^oo. 

Remark 3.2 While this article is mainly focused on the case p — )■ 0, the asymptotic 
results which follow can also be applied in other situations where Assumption (A) is 
satisfied. We do not allow C = 00 in Assumption (A), because then using the Bayes 
oracle the limit of the probability of a type II error is equal to one and signals cannot 
be identified. For other values of C the limiting power for the detection of signals is 
in (0, 1]. We call the corresponding parametric region detectable. If C = 0, then the 
oracle has a limiting power equal to one (see equation (I3.13P below). As discussed 
in Section 4.1, such a situation can occur naturally if the number of rephcates used 
to calculate Xj increases to infinity as p — i- 0. In the case where C G (0, 00), the 
asymptotic power is smaller than one and we refer to the corresponding parametric 
region as "the verge of detectability" . 
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When p — )■ and \og{S) = o(log(p)), Assumption (A) reduces to — — ^ Cu G 
(0, oo] and specifies the relationship between the magnitude u of asymptotically 
detectable signals and the sparsity parameter p. Interestingly, in this case, sig- 
nals on the verge of detectability, u oc — logp, can be related to asymptotically 
least-favorable configurations for lo[p] balls (defined in Section 5 below) discussed 
in Section 3.1 of [1]. Ignoring constants, the typical magnitudes of observations 
corresponding to such signals will be similar to the threshold of the minimax hard 
thresholding estimator corresponding to the parameter space lo[p]. 



Remark 3.3 A similar relationship between the sparsity parameter p and the signal 
magnitude can also be shown to be necessary for ensuring non-trivial inference 
for mixtures of other types of scale families, for example the gamma family or a 
generalization of the double exponential, namely f{x) = ea;p(— Ixl""), a > 0. 

Notation : We will usually suppress the index t of the elements of the vector 7^ 
and 9t. Unless otherwise stated, throughout the paper the notation ot will denote 
an infinite sequence of terms indexed by t, which go to zero when t — )■ 00. In many 
cases t is the same as the number of tests m and in such cases the notation ot will 
be replaced by Om- 



Lemma 3.1 Under Assumption (A) the probability of a type II error using the 
Bayes oracle is given by the following equations: 

t2 = (2$(v^)-l)(l + Ot), (3.12) 

when C E (0, 00), and 

t,= [^(l+o.), (3.13) 

when C = 0. 



Proof. Lemma [3.11 easily follows from (13.1 ip and Assumption (A). □ 



3.2. Type I errors 

Lemma 3.2 Under Assumption (A), the probability of a type I error using the 
Bayes oracle is given by 

(3.14) 
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Proof. Note that ti = P{\Z\ > Cu,v)- Moreover, 

c^,^ = logt^(l + , (3.15) 



where lim^^oo v^oo ZuvU = 1. Therefore, we obtain 



0(c„,t,)V27rt; 



exp 



-Zu,^ log V 



where is the density of the standard normal distribution. This, together with 
Assumption (A), yields 



0(c.,.) =6-^/^^(1 + 0^) . (3.16) 

Now the proof follows easily by invoking the well known approximation to the 
tail probability of the standard normal distribution 

P(|Z|>c) = ^(l-zi(c)), (3.17) 
c 

where zi{c) is a positive function such that (c)c^ = 0(1) as c oo. □ 



3.3. The Bayes risk 

Under an additive loss function, the Bayes risk for a multiple testing procedure is 
given by 

m 

R = -P)^i.'^0 +pt2,5A}. (3.18) 

1=1 

In particular, the Bayes risk for a fixed threshold multiple testing procedure is given 
by 

R = m{{l-p)ti5Q+pt25A) . (3.19) 

Equations f l3.12p . f l3.13p and f l3.14p easily yield the following asymptotic approx- 
imation to the optimal Bayes risk. 

Theorem 3.1 Under Assumption (A), using the Bayes oracle the risk takes the 
form 

Ropt = mpSA\l^^{l + ot) , (3.20) 

V TTU 

when C = 



or 

when < C < oo. 



Ropt = mp6A{2<l>iVC) - 1)(1 + Ot) , 



(3.21) 
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Remark 3.4 It is important to note that under Assumption (A), the asymptotic 
form of the risk under the Bayes oracle Ropt is determined by the component of risk 
corresponding to type II errors, mpt2SA- This is due to the fact that the probability 
of a type I error is much more sensitive to a change in the threshold than the 
probability of a type II error. Specifically, it is easy to check that a "slight" decrease 
in the threshold leads to an increase in the rate of convergence of the component 
of risk corresponding to type I errors such that it equals the rate of convergence of 
Ropt, without affecting the rate and the constant corresponding to the type II error 
component. Thus the risk of the resulting "balanced" rule would have the same rate 
of convergence as the Bayes oracle, but with a larger constant of proportionality. 

4 Asymptotically optimal rules 

In this section we formally define the asymptotic optimality of multiple testing rules 
and then characterize the class of asymptotically optimal rules with fixed thresholds. 

Consider a sequence of parameter vectors 7f, satisfying Assumption (A). 
Definition. We call a multiple testing rule asymptotically optimal for 7^ if its risk 



where Ropt is the optimal risk, given by Theorem 13.11 

Remark 4.1 This definition relates optimality to a particular sequence of 7 vectors 
satisfying Assumption (A). However, the asymptotically optimal rule for a specific 
sequence 7t is also typically optimal for a large set of "similar" sequences. The 
asymptotic results presented in the following sections of this paper characterize 
these "domains" of optimality for some of the popularly used multiple testing rules. 
Since Assumption (A) is an inherent part of our definition of optimality, we will 
refrain from explicitly stating it when reporting our asymptotic optimality results. 

The following theorem fully characterizes the set of asymptotically optimal mul- 
tiple testing rules with fixed thresholds. 

Theorem 4.1 A multiple testing rule of the form ^3. ^) with threshold = = 
log V + Zt is asymptotically optimal if and only if 



R satisfies 



R 



— )■ 1 as t — 7- 00 



R, 



'Opt 



Zt = o(logw) 



(4.22) 



and 



Zt + 2 log log V 00 . 



(4.23) 



The proof of Theorem 14.11 is given in Appendix 18.1.1 
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Remark 4.2 Conditions ( I4.22p and fl4.23p guarantee the asymptotic optimality of 
the components of risk corresponding to type II and type I errors, respectively. 

Remark 4.3 We have observed that the Bayes oracle is a fixed threshold test. So 
it is natural that an optimal multiple testing procedure will be of this kind or will 
behave (at least asymptotically) like a fixed threshold test, with the threshold de- 
pending on the unknown parameters. So a study of the optimality of fixed threshold 
tests may give important clues about the optimality of more general tests. In Section 
6 this is shown to be true in the context of proving the optimality of the popular 
Benjamini-Hochberg [2] procedure. 

4.1. Examples 

Here we present two multiple testing rules, which are asymptotically optimal when 
Pm oc —. Both rules are closely related to the universal threshold 21ogm of [11], 
which, according to [9] and [I2], has some optimality properties under sparsity. [9] 
and [Ij consider a range of sparsity given hj pm oc , with (3 <1. Here we consider 
more extreme sparsity, /3 = 1, and prove the asymptotic optimality of a universal 
threshold for signals at the verge of detectability. The second of the rules considered 
is a modification of the universal threshold, which is asymptotically optimal when 
each of the tests is based on n replicates. 

Lemma 4.1 Assume that 5 = constant, m — )■ cxo and pm — )■ s, where < s < cxo. 
Then the multiple testing rule liS. 8\) based on the threshold 

= = 2\ogm + d , (4.24) 

where d E H, is asymptotically optimal for signals on the verge of detectability, 
u = (3i logm(l + Om), with f3i G (0, oo). 

Lemma 4.2 Let ctq = 0. Assume that each test statistic Xj is based on n = Um 
replicates and = ^(1 + Om), where a1 represents the variance of Xi for one 
replicate. Moreover, assume that m — )■ oo, r = constant, 6 = constant, pm — >■ 
s G (0, oo) and — )■ Si G [0,oo). The multiple testing rule ^3.8\) based on the 
threshold 

= Cm,n = log n + 2 log m + d , (4.25) 
with d eH, is asymptotically optimal. 

The main difference between the rules defined in fl4.24p and (14.250 is the different 
ranges of scaled signal magnitudes u for which they are optimal. The rules proposed 
in Lemma 14.11 are asymptotically optimal for the smallest detectable signals, which 
are of the order of log m. On the other hand, rules of the form (14.250 are asymptot- 
ically optimal when u = (^^^ is proportional to n, which can be of a substantially 
larger order than logm (since Si can be equal to 0). Note however that such a 
situation can only occur if ctq = (i.e. when we test Hoi : /ij = 0). If ctq > then 
the variance of the test statistic under the null hypothesis, a^, does not converge to 

when n — )■ oo and u is bounded from above by (^^^ . Thus the rule (I4.25P is not 
recommended for the detection of outlying signals from the background noise. 
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5 Controlling the Bayesian False Discovery Rate 

In the previous section we described two rules, which are asymptotically optimal 
when the expected number of signals converges to a finite constant and hence re- 
mains bounded as the number of tests increases. However, this assumption is often 
unrealistic. In many applications the main reason for performing a large number 
of statistical tests is the belief that increasing the number of tests will enable the 
detection of a larger number of true signals. 

In this context, we refer to a recent paper [1], where it has been shown that the 
well known Benjamini-Hochberg procedure (BH, |2]), originally proposed in [29] and 
later in [31] , can be used to estimate a sparse vector of means, where the level of spar- 
sity can vary considerably. In [1], independent normal observations Xj, i = 1, . . . ,m 
with unknown means /ij and known variance are considered. Among the studied 
parameter spaces are /o[Pm] balls, which consist of those real m- vectors for which 
the fraction of non-zero elements is at most Pm- A data-adaptive thresholding esti- 
mator for the unknown vector of means is proposed using the Benjamini-Hochberg 
rule controlling the FDR at am > j^^^ for each m > 1 and some constant /3i > 1. 
If the FDR control level converges to G [0, 1/2], this estimator is shown to 
be asymptotically minimax for a large class of loss functions (and in fact for many 
different types of sparsity classes including Iq balls), as long as Pm is in the range 
si^,m-/^^],with /32e(0,l). 
Here we want to use the framework presented in Sections 3 and 4 to investigate 
the asymptotic optimality of BH for a broad range of sparsity levels by studying its 
Bayes risk with respect to an additive loss function. We reemphasize that minimiz- 
ing Bayes risk with respect to such a loss function seems to be a natural optimality 
criterion in the context of testing, where the main goal is to correctly detect sig- 
nals, rather than estimating their magnitude. However, it is not easy to show the 
optimality of BH directly, because it is a random thresholding rule (see Section 6). 
On the other hand, it was proved by Genovese and Wassermann (GW) in pLTj that 
when p remains fixed, as the number of tests increases, this random threshold can 
be approximated by a non-random one (defined in equation fl5.54p below). 

When p ^ 0, the approximate threshold of [T7] is basically the same as that of 
a fixed threshold rule controlling the Bayesian False Discovery Rate (BFDR, [13], 
defined below) at the same level. If a BFDR control rule is optimal under sparsity, 
the same can be expected of the corresponding rule using the threshold of [17j . The 
optimality of BH in turn may be proved by showing that, even under sparsity, GW 
thresholds of the form 05.541) are tight estimates of the BH random threshold. In 
Section 6 we will actually see that this can be done for a broad class of sparsity 
levels. 

In the present section we first recall the definition of the Bayesian False Discovery 
Rate (BFDR) and then briefiy motivate why one might expect that controlling 
the BFDR leads to an optimal rule. We provide general necessary and sufficient 
conditions under which fixed threshold rules controlling the BFDR at level a will 
be asymptotically optimal in terms of the Bayes risk. We then show that under 
sparsity the same conditions ensure the optimality of a threshold rule using the GW 
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threshold. As a simple consequence of our general results, we finally show in Section 
5.5 that, in addition, rules based on the Bonferroni correction are asymptotically 
optimal in the extremely sparse case. 



5.1. The False Discovery Rate and Bayesian False Discovery 
Rate 

In a seminal paper [2], Benjamini and Hochberg introduced the False Discovery Rate 
(FDR) as a measure of the accuracy of a multiple testing procedure: 

FDR = e{^ . (5.26) 

Here R is the total number of null hypotheses rejected, V is the number of "false" 
rejections and it is assumed that ^ = when i? = 0. For tests with a fixed threshold, 
Efron and Tibshirani [13] define another very similar measure, called the Bayesian 
False Discovery Rate, BFDR: 

BFDR = P(Hoi is truelifoi was rejected) = ^} ~ -^^^^ , (5.27) 

[1 - p)ti + p{l - t2) 

where ti and t2 are the probabilities of type I and type II errors. 

Note here that in our context it is enough to consider threshold tests that reject 
for high values of This is due to the fact that from the MLR property and the 
Neyman-Pearson Lemma, it can be easily proved that any other kind of test with 
the same type 1 error will have a larger BFDR and Bayesian False Negative Rate 
(BFNR). 

Extensive simulation studies and theoretical calculations in |T7], and [3] il- 
lustrate that multiple testing rules controlling the BFDR at a small level a ~ 0.05 
behave very well under sparsity in terms of minimizing the misclassification error 
(i.e. the Bayes risk for 6o = 6a). We also recall in this context that a test has BFDR 
a if and only if 

{l-a){l-p)ti + apt2 = ap , (5.28) 

the l.h.s. of f l5.28p being the Bayes risk for 6o = 1 — a and 6a = «• So the 
definition of the BFDR itself has a strong connection to the Bayes risk and a "proper" 
choice of a might actually yield an optimal rule (for similar conclusions see e.g., 
[2S])- To support this statement. Lemma [HH] in Appendix 18.2.1 shows that under the 
mixture model fl2.6p . the BFDR of a test based on the threshold continuously 
decreases from {1 — p) for c = to for c — )■ cxd. In other words, there exists a 1-1 
mapping between thresholds c G [0, oo) and BFDR levels a G (0, 1 — p]. So, if the 
BFDR control level is chosen properly, the corresponding threshold can satisfy the 
conditions of Theorem 14.11 Naturally, such "optimal" BFDR control levels must be 
sufficiently similar to the BFDR of the Bayes oracle. 
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Keeping the above in mind, in the next two sub-sections we explore the relation- 
ship between BFDR control rules and the Bayes oracle. Specifically, we calculate 
the BFDR of the Bayes oracle and specify the conditions under which BFDR control 
rules are asymptotically optimal. 



5.2. BFDR of the Bayes oracle 

Before going to the mathematical derivations, we first observe some simple facts 
that indicate the form of dependency of the BFDR of the Bayes oracle on the ratio 
of losses 5 = j^. Suppose the BFDR of the Bayes oracle is a. Using definition 
f l5.27p . we can easily see that the corresponding ratio between the type I and type 
II components of the risk satisfies the following relationship: 

Ml-p)*._,/^Wi^y (5.29) 



^APh \1 - a 

We recall that under Assumption (A), the Ihs of fl5.29p converges to zero. Since t2 is 
strictly smaller than 1, this can only happen when 6 (yt^) (and hence 6a) converges 
to zero. This implies that if 6 remains fixed, then a tends to zero. Also, the Bayes 
oracle can have a constant or a non-zero limiting BFDR, only if 6 converges to 0. 

Let us define tu,v,s = Sy/u\ogv and denote the BFDR of the Bayes oracle 
by BFDRbo- Lemma 15.11 and the remark below show the exact dependence of 
BFDRbo on 5 and u. 

Lemma 5.1 Suppose Assumption (A) holds. If tu,v,5 — ^ oo, then BFDRbo con- 
verges to zero at a rate specified by the formula: 

BFDRbo = \—r: (1 + ' (5-30) 

V TT Dtu,v,S 

where D = 2{\ — ^{\fC)) is the asymptotic power. 

If tu,v,s Ci, where < Ci < oo, then BFDRbo converges to a constant and 
is given by, 

BFDRbo = ?^ (1 + Ot) (5.31) 

Proof. Note that 

BFDR = , (5.32) 

where / = The lemma follows easily by observing that fl3.14p yields 



A = \/-7 (1 + Ot), (5.33) 
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while fl37[3|) and (1312|) give 



1 - h = 2{1 - ^VC)) + ot . 



(5.34) 



□ 



Remark 5.1 From equations (15.321) . (I5.33P and (I5.34p . it is clear that the BFDR of 
the Bayes oracle is essentially a decreasing function of tu,v,5 = {5 y/uY^^'^'"^ f''^'"\ where 



when both 6 and u increase (since b{v) ~ for large v, even large variation in the 
level of sparsity (/) will not typically alter this fact). In particular, Lemma [5?T] shows 
that under Assumption (A) and for 6 = constant, the BFDR of the Bayes oracle 
converges to zero at the rate 1 / y/u logv. Specifically, in the case when p — )■ and 
u = —clogp (c > 0) (i.e. on the verge of detectability) the BFDR of the Bayes oracle 
converges to zero at the rate (— logp)~^. The Bayes oracle has a non-zero limiting 
BFDR only if the ratio of losses converges to at such a rate that 6'^ulogv — )■ Ci, 
where Ci < oo. This condition requires that the relative loss for type II errors 
increases to infinity slightly quicker than u (i.e. there is a higher penalty for missed 
signals when they are sparse and relatively large). Specifically, for signals at the 
verge of detectability, u = —clogp, the Bayes oracle has a fixed limiting BFDR if S 
is of the order (— logp)~^. 

5.3. Asymptotic optimality of BFDR control rules 

In section 5.2 we computed the BFDR of the Bayes oracle. These results give some 
indication of how the BFDR level a should be chosen to obtain control rules that 
are asymptotically optimal in terms of the Bayes risk. In this section we give a full 
characterization of asymptotically optimal BFDR levels. There is some flexibility 
in the choice of a, although the general behavior, as expected, is closely related to 
the behavior of the BFDR of the Bayes oracle under similar circumstances. The 
general Theorem 15.11 below, gives conditions on a, which guarantee optimality for 
any given sequence of parameters jt, satisfying Assumption (A). In Section 15.4.1 
we present some clearer results, which characterize BFDR control rules which are 
asymptotically optimal on the verge of detectability and in its close neighborhood. 
Specifically, Corollaries 15.51 and 15.61 give simple conditions on a and 6, which make 
these parameters only dependent on the number of tests m. 

Consider a fixed threshold rule (based on controlling the BFDR at the level a. 
Under the mixture model (12.61) . a corresponding threshold value can be obtained 
by solving the equation 



b{v) 




. This effectively says that the BFDR of the Bayes oracle decreases 



(1-p)(1-<I>(cb)) 



= a 



(5.35) 




or equivalently, by solving 
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where 

Ta = . (5.37) 

Note that Tq, converges to when a — > and to infinity when a — )■ 1. 

Using Theorem 14.11 one can show that this test is asymptotically optimal only 
if -^^Y converges to a/C*, where C is the constant in Assumption (A). From f l5.36p . 
this in turn implies that a BFDR control rule for a chosen a sequence can only 
be optimal if y goes to zero while satisfying certain conditions. When ^ — )• 0, a 
convenient asymptotic expansion for can be obtained and optimality holds if and 
only if this asymptotic form conforms to the conditions specified in Theorem 14. 1[ 
The following theorem give the asymptotic expansion for and specifies the range 
of "optimal" choices of r^. 

Theorem 5.1 Consider a rule controlling the BFDR at level a = at- Define St by 

!P^/^ = 1 + ... (5.38) 

log(//r„) 

Then the rule is asymptotically optimal if and only if 

(5.39) 

and 

2st log(//r«) - log log(//r„) -> -oo . (5.40) 
The threshold for this rule is of the form 

4 = 21og (^pj - log (2\og (^pj^ +C, + Ot, (5.41) 

where Ci = log (^:;jy2^ and D = 2{1 — $(\/C)) is the asymptotic power. The corre- 
sponding probability of a type I error is equal to 

t, = Dj{l + Ot) . 
The proof of Theorem 15.11 can be found in Appendix 18.3.1 



Remark 5.2 In comparison to (15. 39 p . condition (I5.40p imposes an additional re- 
striction on positive values of St (i.e large values of a). It is clear from the proof of 
Theorem 15.11 that the necessity of this additional requirement results from the asym- 
metric roles of type I and type II errors in the Bayes risk, as discussed in Section 
4. 
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Remark 5.3 Condition fl5.39p . given above, says (after some algebra) that a se- 
quence of "optimal" BFDR levels a = at satisfies a = (1 + (5^^)^^^*/^*)"^ (i-^ 
Ta = (^x/m)"^"^*/"^*) for some 6j, where 6t — )■ as t — )■ oo. Thus asymptotically, 
the optimal BFDR levels will generally be smaller as 5 and u get larger (or one in- 
creases while the other is fixed). Since ht is small, again variation in / will typically 
have a minimal effect. Thus the general behavior of optimal BFDR levels is similar 
to what we observed for the BFDR of the Bayes oracle. Below we present several 
corollaries of Theorem 15. ![ which provide additional, more explicit conditions for the 
optimality of BFDR control rules, each of which corroborates this broad finding. 

Corollary 5.1 rule controlling the BFDR at the level a = at, such that = 
^^(1 + Ot), with s G (0, oo), is asymptotically optimal. 

The proof of Corollary 15.11 is immediate by verifying that fl5.39p and (15.401) are 
satisfied by such a sequence of a's. 

As a special case we discuss the situation described in Lemma 14. 2[ where each 
of the test statistics Xj is based on n = replicates and the focus is on testing 
the simple null hypothesis fii = 0. In this case any BFDR level a = oc ^ is 
asymptotically optimal: 

Corollary 5.2 Assume that al = with Cg- G (0, oo). Moreover, assume that 
p — 0, n — 7- oo, — ^ — 7- s G [0, oo), ctq = 0, 5 = const and r = const. Then 
a rule controlling the BFDR at the level an = si G (0, oo), is asymptotically 
optimal. 

Proof. This is a direct consequence of Corollary 15.11 □ 

The following two corollaries shed some more light on asymptotic optimality of 
rules controlling the BFDR at a fixed level a G (0, 1) or when S remains fixed. 

Corollary 5.3 A rule controlling the BFDR at a fixed level a G (0, 1) is asymp- 
totically optimal if and only if the ratio of loss functions converges to at such a 
rate that 

log{5y/u) 
\ogp 

and 



^0 . (5.42) 



-> (5.43) 



logp 

Proof. Note that the term st defined in Theorem 15. II is given by 

^ ^ogif^Vu) _ ^ ^ log((5ra0x) 
* log(//r„) log(//r^) 

We first show necessity. By Theorem 15.11 optimality holds only if (I5.39P is fulfilled 

log(i5rav/tt 
log{//rc) 



i.e. '°g(^'""v^) 0. Under Assumption (A), this can happen only if p — )■ 0, since a is 
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a constant. When p — )■ 0, condition f l5.39p reduces to f l5.42p . To complete the proof 
of necessity, observe that condition fl5.40p from Theorem 15.11 imphes that for fixed 
a 

21og(5yu) - loglog/ ^ -oo , 
which yields fl5.43p when p — )■ 0. 

To prove sufficiency, first observe that Assumption (A) and fl5.43p together imply 
that p — )■ 0, since f — )■ oo and fl5.43p is equivalent to saying that jj]^ goes to zero. 
Hence for fixed a G (0, 1), fl5.42p and (15.431) imply that St satisfies properties fl5.39p 
and fl5.40p . respectively. □ 



Remark 5.4 Under Assumption (A), conditions (I5.42p and fl5.43p can occur to- 
gether only if 5 — i- 0. To this end, observe that Assumption (A) and fl5.43p together 
imply that p — > 0. When p — 0, fl5.42p is equivalent to saying that ^1. Un- 

der Assumption (A), this implies that converges to the required C G (0, oo\. 

This and fl5.43p together imply that 5 — )• 0. 

Corollary 5.4 Suppose 5 is fixed. Then a rule controlling the BFDR at level a = at 
is asymptotically optimal if and only if a converges to at such a rate that 

logfa^/n) 
log (//«) 

and 

(5.45) 



log(//«) 

Proof. The proof of this result is very similar to the proof of Corollary 15.31 and 
is therefore omitted. □ 



Corollary 15. 3[ given above, states that under Assumption (A), a rule controlling 
the BFDR at a fixed level a can be optimal only if p — )■ and, due to Remark 15. 4| 
the relative cost of type II errors increases. In particular, this implies that such 
a rule will not be asymptotically optimal in the problem of minimizing the overall 
misclassification rate (since in this case 5o = ^a)- This result provides important 
insight and brings new aspects of BFDR control procedures under sparsity into light 
in the context of multiple testing. 



5.4. Optimal BFDR control on the verge of detectability 

In this section we present several results describing the behavior of BFDR control 
rules on the verge of detectability and in its neighborhood. Optimality on the verge 
of detectability is particularly important, since it guarantees asymptotically optimal 
performance in a very difficult scenario where signals are so small that they are barely 
detectable. Hence, procedures which are optimal on the verge of detectability are 
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expected to be robust and give good overall performance when no prior information 
about the magnitudes of signals is available. 

The first of our results below states that for signals on the verge of detectability, 
a rule controlling the BFDR at level a G (0, 1) is asymptotically optimal if and 
only if the ratio between loss functions 5 decreases to at a relatively slow rate. 
The second lemma, dual to the first one, states that if 5 = constant, then a BFDR 
control rule is asymptotically optimal for signals on the verge of detectability if and 
only if the BFDR level a decreases to zero at the same, very slow rate. This last 
result explains the good performance of BFDR control rules observed with respect 
to controlling the misclassification error for small a's, reported in J3j and [4]. 

Lemma 5.2 Suppose Assumption (A) holds with C G (0, oo) and p — )■ 0. ^4 rule 
controlling the BFDR at a fixed level a G (0, 1) is asymptotically optimal if and only 
if 6 ^ at such a rate that 0. 

Proof. This is a direct consequence of Corollary 15.31 □ 



Lemma 5.3 Assume thatp — )■ 0, 5 = constant, and — ^ — )■ C, with < C < oo. 
A rule controlling the BFDR at level a = at & (0, 1) is asymptotically optimal if and 
only if a ^ at such a rate that — > 0. 

a J log p 

Proof. This is a direct consequence of Corollary 15.41 □ 

The conditions specified in Lemmas 15.21 and 15.31 make 8 and a dependent on 
the unknown sparsity parameter p. To make these results more applicable, we now 
consider the situation in which the number of tests m goes to infinity and pm is such 
that ^ 

— ^"^ _!. for some constant K G (0, oo) . (5.46) 
log m 

Note that this large set includes all decreasing sequences Pm such that m~'^^ Pm ^ 
m~'^2, where Ci and C2 are any constants satisfying < C2 < Ci < oo . 

Corollary 5.5 Consider the whole class of sparsity sequences pm satisfying (5^J^. 
A rule controlling the BFDR at a fixed level a G (0, 1) is asymptotically optimal for 
signals on the verge of detectability if and only if 

6m ^0 and l^^^O . (5.47) 

logm 

Remark 5.5 It is easy to check that under (15.461) and (15.471) signals on the verge 
of detectability are of the form 



Um = f3 log m{l + Om) , with /3g(0, oo) 



(5.48) 
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Corollary 5.6 Consider the class of sparsity sequences pm satisfying 1^5. 40^ - As- 
sume that the ratio between losses 5 is fixed. Then a BFDR control rule is asymp- 



m 



totically optimal for signals on the verge of detectability ( 5.48 ) if and only if a = a 
satisfies 

Iqct (\ 

^ and , . (5.49) 

logm 

Conditions (15 .47^ and (15 .49^ allow some freedom in the choice of 5 and a. In- 
terestingly, some of these choices guarantee asymptotic optimality for signals which 
are substantially larger than those on the verge of detectability. Corollaries 15.71 and 
15. 8[ given below, specify the range of magnitudes of signals for which such rules are 
asymptotically optimal. 

Corollary 5.7 Suppose the number of tests m oo. Consider the class of sparsity 
sequences pm satisfying \5.4(!^ . Suppose 6m 0, such that log 5m = o(logm). A 



rule controlling the BFDR at a fixed level a G (0, 1) is asymptotically optimal if and 
only if the sequence of the magnitudes of signals Um satisfies 

™ — )■ G (0, oo) (verge of detectability) (5.50) 



or 

u. 



logm 

— )■ OO and Um = o I —— — . (5.51) 



TO 




logm 

Proof. Corollary 15.51 follows directly from Corollary 15.31 □ 



Corollary 5.8 Suppose the number of tests m — )■ oo. Consider the class of sparsity 
sequences pm satisfying ( [5.^^^ . Suppose that the ratio between loss functions 6 is 
fixed. A rule controlling the BFDR at the level am 0, such thatlogam = o(logm), 
is asymptotically optimal if and only if the sequence of magnitudes of signals Um 
satisfies 

u 

- — — > Cu ^ (0, oo) (verge of detectability) (5.52) 

logm 

or 

Um ^ , (\ogm\ 

OO and Um = o \ — - — . (5.53) 

logm V «m / 

Proof. Corollary 15.61 is a direct consequence of Corollary 15.41 □ 



5.5. Optimality of the asymptotic approximation to the BH 
threshold 

In [T7] it is proved that when the number of tests tends to infinity and the fraction 
of true alternatives remains fixed, then the random threshold of the Benjamini- 
Hochberg procedure can be approximated by 
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(1 - Hcgw)) 

cgw '■ 7 7 \T = a • (5.54j 

(l-p)(l-$(cGvy))+p(l-$fe)) 

Compared to the equation defining the BFDR control rule f l5.35p . the function on 
the left-hand-side of f l5.54p lacks (1 — p) in the numerator. In the case where p — ?• 
this term is negligible and one expects that the rule based on cqw asymptotically 
approximates the corresponding BFDR control rule for the same a. The following 
theoretical result shows that this is indeed the case. 

Theorem 5.2 Suppose p — t- 0. Consider the rule rejecting the null hypothesis Hoi 
if ^ where cgw is defined in ^5.54^ . This rule is asymptotically optimal if 



and only if the corresponding BFDR control rule defined in Ii5. 35\) is asymptotically 
optimal. In this case we have 



2 _ 2 , 

(^GW — Cb -r Ot 



where c% is the threshold of an asymptotically optimal BFDR control rule, defined 
in Theorem \5.1i 



Proof. 

Note that fl5.54p is equivalent to 



1 - ^{cgw) pre 



(5.55) 



where a' = a{l — p). Thus cgw is the same as the threshold of a rule controlling 
the BFDR at the level a'. 

Define Sj/ by = l + St/. It follows easily that Sj/ satisfies fl5.39p and f l5.40p 

of Theorem 15.11 (with a replaced by a'), if and only if St defined in fl5.38p satisfies 
f l5.39p and (15.40 p . Thus the first part of the theorem is proved. 

To complete the proof of the theorem, we observe that the optimality of a BFDR 
control rule implies that y — )■ and the optimality of the rule based on cgw implies 
that ^ — )■ 0. In either case, pva — )■ and thus (I5.55P reduces to 

pra{l + Ot) = -ri^ + Ot) . (5.56) 



/u+l 



Now, the asymptotic approximation to Cgw can be obtained analogously to the 
asymptotic form of the threshold for an optimal BFDR control rule, provided in 
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5.6. Optimality of the Bonferroni correction 

The Bonferroni correction is one of the oldest and most popular multiple testing 
rules. It is aimed at controlling the Family Wise Error Rate: FWER = P{V > 0), 
where V is the number of false discoveries. The Bonferroni correction at FWER 
level a rejects all null hypothesis for which Zi = '-^ exceeds the threshold 

CBon ■ 1 - ^{cBon) = 7T~ ' 

2m 

Under the assumption that m — )■ oo, the threshold for the Bonferroni correction 
can be written as 

cL„ = 2 log - log (2 log ) + log(2/7r) + • (5.57) 

Comparison of this threshold with the asymptotic approximation to an optimal 
BFDR control rule f l5.4ip suggests that the Bonferroni correction will have similar 
asymptotic optimality properties in the "extremely" sparse case Pm oc ^ (see also 
the comparison with the "universal threshold" discussed in Section 4.1). Indeed, 
these expectations are confirmed by the following lemma, which actually specifies a 
slightly larger set of sequences of sparsity parameters under which the Bonferroni 
correction is asymptotically optimal. Lemma l574l will be used in the next section for 
the proof of the optimality of the Benjamini-Hochberg procedure under very sparse 
signals. 

Lemma 5.4 Assume that m — )■ 00 and pm = where Zm converges to a finite 
positive constant or diverges to infinity at such a rate that 



1^ ^ . (5.58) 
logm 

The Bonferroni procedure at FWER level am Ooo £ [0, 1) is asymptotically opti- 
mal if am satisfies the assumptions of Theorem I5.il 

Proof. Observe that under the assumptions of Lemma 15.41 and Theorem 15.11 

c|on = c| + 2 log - 2 log(l - aoo) + 2 log D + o^, 

where D = 2(1 — ^{\/C)) and c% is the threshold of the rule controlling the BFDR 
at level am- From fl5.58p it follows easily that c^^n = ^^(1 + Om)- By assumption, 
the rule based on the threshold c% is optimal, and hence c%^^ satisfies condition 
f l4.22p of Theorem (14. ip . Condition f l4.23p is satisfied, since by assumption log Zm 
is bounded below for sufficiently large m and thus the optimality of the Bonferroni 
correction follows. □ 
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6 Optimality of the Benjamini-Hochberg proce- 
dure 

In this section we report results on the asymptotic optimahty of the Benjamini-Hoch- 
berg procedure (BH). We consider a sequence of problems in which the number of 
tests m — )■ cxD and the 7 sequence is indexed hj t = m. In Section 6.1 we present 
BH and the formula for its random threshold cbh- In Section 6.2, we prove the 
asymptotic optimality properties of BH under a wide range of sparsity parameters 
Pm — )■ 0, such that mpm — )■ s e (0, 00]. The proof of the optimality of the type I 
error component of the risk is based on the precise results of [16] on the expected 
number of type I errors under the total null hypothesis and holds over the whole 
range of the sparsity parameters considered. The proof of the optimality of the 
type II error component is broken into two parts. In the extremely sparse case, 
described by Lemma 15. 4[ the optimality of BH follows from a comparison with the 
asymptotically optimal Bonferroni correction. For the remaining range of sparsity 
parameters, the proof follows from the approximation of the random threshold of BH 
by the asymptotically optimal threshold cqw (see fl5.54p ). given by [17J. Theorem 
16. 3t given below, extends the results of [T7] to our sparse asymptotic scenario and 
illustrates the accuracy of this approximation. 

Our results establish that, under the considered range of sparsity parameters, 
BH is asymptotically optimal if the chosen FDR control level am depends on Um 
and Sm in the same way as specified in Theorem 15. 1[ Specifically, we show that 
BH at any fixed FDR level will be optimal for a wide class of sparsity levels and 
magnitudes of signals, as long as the loss ratio goes to zero slowly. A similar result is 
proved for the case of a fixed loss ratio, as long as the FDR control level goes to zero 
slowly. In Section 16.2. .31 we give very transparent results describing the optimality 
properties of BH for signals on the verge of detectability. Sections 16.2. .41 and 16. 2. .51 
contain a comparison between the thresholds of the asymptotically optimal BH rule 
and the Bayes oracle and a summary of results on the expected numbers of true and 
false rejections by rules which are optimal on the verge of detectability. 

6.1. Random threshold of the Benjamini-Hochberg proce- 



Let Zf = and Pi = 2(1 — $(^j)) be the corresponding p-value. We sort p-values 
in ascending order < j9(2) < . . . < P{m) and denote 



BH at FDR level a rejects all the null hypotheses for which the corresponding p- 
values are smaller than or equal to p(^k)- 

Let us denote 1 — Fmiy) = i^{\Zi\ > y}/m. It is easy to check (eg. see [H]) that 
the Benjamini-Hochberg procedure rejects the null hypothesis Hoi when Zf > c%jj, 



dure 




(6.59) 
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where 

5«„^inf|,:?fi-!M<4 . (6.60) 

[ 1 - F^{y) J 

Note also that BH rejects the null hypothesis Hoi whenever Zf exceeds the 
threshold of the Bonferroni correction. Therefore, we define the random threshold 
for BH as 

Cbh = mm{cBon, Cbh} ■ 

Comparing fl6.60p and f l5.54p . we observe that the difference between cbh and 
ccw is in replacing the cumulative distribution function of \Zi\ (appearing in f l5.54p ) 
by the empirical distribution function (in I6.60p . The proof of the optimality of BH, 
presented in the next section, is partially based on the investigation of the accuracy 
of this approximation. 

6.2. Optimality of BH 

To prove the optimality of the BH rule, we distinguish two cases. The first, the 
extremely sparse case, is characterized by 

fnPm — )■ s G (0, oo and — )■ . (6.61) 

logm 

Specifically, condition f l6.6ip is satisfied by very sparse signals with oc ^. The 
second, "denser" case is characterized by 

^ and ^CpE (0, 1] . (6.62) 

log m 

The theorem on the optimality of BH presented in this section requires the 
following assumptions. 

1. Number of tests m and the sparsity p^: 

m — )■ oo, — ^ 0, mpm — 7- s G (0, oo] (6.63) 

2. FDR level a^: 

am — > ttoo < 1, and (6.64) 
am satisfies the conditions of Theorem (15.11) . (6.65) 
i.e a rule controlling the BFDR at level a^ is asymptotically optimal 

3. Additional assumption for the denser case: 

If fl6.62p holds, then assume 



Ur. 



< , for some /3 > . (6.66) 
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Theorem 6.1 Under Assumptions Ii6. 63\) - 1(6. 65^) . and the additional assumption 
Ii6. 66\) in the case where li6.6S\) holds, BH is asymptotically optimal. 

The proof of Theorem 16.11 consists of two parts. The first part, concerned with 
the optimahty of the type I error component of the risk, is based on the precise and 
powerful resuhs of [TB] on the expected number of false discoveries using BH under 
the total null hypothesis. This part of the proof does not require distinguishing 
between the extremely sparse and the denser case. 



6.2..1 Bound on the type I error component of the risk 

The first and most essential step of the proof of the optimality of the type I error 
component of the risk relies on showing that, under certain conditions, the expected 
number of false discoveries of BH, EV, is bounded by c^aK, where a is the FDR 
level, K is the true number of signals and positive constant. This result is 

very intuitive in view of the definition of FDR (see fl5.26p ). The proof is however 
nontrivial, due to the difference between E (^^^ and 

Lemma 6.1 Consider the BH rule at a fixed FDR level a < ao < 1. Let K he the 
number of true signals. The conditional expected number of false rejections given 
that K = k, with k < rn{-^ — 1), is bounded by 

W- = *)<"(^ + ^). (6.67) 

Specifically, for 1 < k < rn{-^ — 1) 

E{V\K = k)< c,,ak (6.68) 

with 

(1 - aoY 

The proof of Lemma 16.11 is given in Appendix 18.4.1 

Remark 6.1 Note that in the case where ao < 0.5, the inequality k < ^{-^ — 1) 
is always fulfilled. 

The following lemma is an extension of Lemma [6. II to the mixture model (12. 6p . 

Lemma 6.2 Under assumptions ^6. 63\) - f6. 65\) . the expected number of false rejec- 
tions is bounded by 

E{V) < Ciammpm , 
where Ci is any constant satisfying 



2-Qo 



Ci> 



y, , when s = oo 

, 2-QoC yjf^^^ g g (^Q^ 



s(l-Qoo) (l-Ooo)^ 
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The proof of Lemma 16.21 is provided in Appendix 18. 4. [ 

Lemma 16.21 easily leads to the following Theorem 16.21 on the optimality of the 
type I error component of the risk of BH. 

Theorem 6.2 Under assumptions h6. 63\) - f6. 65\) . the type I error component of the 
risk of BH, Rq, satisfies > 0, where Ropt is the optimal risk defined in Theorem 

Proof. From Lemma [6.21 

Ro SoE{V) ^ 

~ ^1 . — . 

Ropt Ropt 

Theorem 16.21 now easily follows by invoking fl5.39p and f l5.40p (included in assump- 
tion (16:651) ). □ 



6. 2.. 2 Bound on the type II component of the risk 

To prove the optimality of the type II component of the risk of BH, we consider 
the extremely sparse case fl6.6ip and the denser case separately. Note that in the 
extremely sparse case, the optimality of the type II component of the risk of BH 
follows directly from a comparison with the more conservative Bonferroni correction, 
which according to Lemma EH is asymptotically optimal in this range of sparsity pa- 
rameters. The proof of optimality for the denser case is based on the approximation 
of the random threshold of BH by the asymptotically optimal threshold cqw (see 
fl5.54p ). given in Theorem 16.31 below. The corresponding "denser" case assumption 
(]6.7ip is substantially less restrictive than f l6.62p and partially covers the extremely 
sparse case fl6.6ip . Theorem 16.31 extends the results of [17J to the case where — ^ 
and illustrates the precision of this approximation. 

Theorem 6.3 Assume that 

P„^^0 , (6.70) 

such that for sufficiently large m 

Pm > , for some constant /3„ > 1 . (6-71) 

m 

Moreover, assume that the sequence of FDR levels am satisfies the assumptions of 
Theorem liS. Then for every e > 0, every constant /3„ > and sufficiently large 
m (dependent on e and (3u) 

P{\cbh-cgw\ > e) < m-^- , 

where cqw is the asymptotically optimal threshold defined in 

The proof of Theorem 16.31 is given in the Appendix. 



Using Theorem 16.31 we can easily show the asymptotic optimality of the type II 
component of the risk of BH. 
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Lemma 6.3 Suppose ( [g. 70{ ) and ( [g. 71\ ) hold and assumptions Ili6.64\ )- f6. 66\) are also 
true. The type II error component of the risk of BH satisfies 

Ra < Roptil + Om) . (6.72) 

Proof. Denote the number of false negatives under the BH rule by La- Let us fix 
e > and let ci = cgw + Clearly, 

E{La) < E{La\cbH < Ci)P{cBH < ci) + mP{cBH > ci) , 
and furthermore 

EilAlcBH < Ci)P{cbh < Cl) < EL, , 

where Li is the number of false negatives produced by the rule based on the threshold 
Cl. Note that the rule based on ci differs from the asymptotically optimal rule cgw 
only by a constant and therefore, from Theorem (14.11) . it is asymptotically optimal. 
Hence, it follows that 6aELi = Ropt{^ + Om)- On the other hand, from Theorem 
16. 3[ for any /3„ > and sufficiently large m (dependent on e and /3„) 

P{cbh > Cl) < m"^" . 

Therefore, 

Ra = 6aELa < Roptil + o„) + 5^mi-^" . 

Now, using assumptions f l6.7ip and (16.661) . and choosing e. g. = /3 + 1, we con- 
clude that Saiti^~'^'' = o{Ropt) and the proof is thus complete. □ 



Remark 6.2 According to Theorem 16.11 the BH rule is asymptotically optimal 
under the scenarios described in Corollaries 15. 1115^ if the assumptions (16. 63 p . (I6.64p 
and (I6.66P are satisfied. 

6. 2. .3 Optimality on the verge of detectability 

Theorem 16. II states that under the sparsity assumption (I6.63p . BH behaves similarly 
to a BFDR control rule. Specifically, this result shows that the optimal FDR level a 
should depend on the expected magnitude of a signal u and the ratio between losses 
5 according to the formula a ~ In this section we present some more specific 
results, which describe the behavior of BH in the important case of signals on the 
verge of detectability. 

Corollary 6.1 Suppose that Assumption h6. 63\) holds. Moreover, assume that 

- C , (6.73) 

where C G (0, oo). The BH rule controlling at a fixed level a G (0, 1) is asymptoti- 
cally optimal if 5m converges to zero such that\og6m = o(logpm)- This last condition 
also guarantees that signals satisfying ( [g. 73 ) are on the verge of detectability. For a 



fixed 6, the BH rule controlling the FDR at level am is asymptotically optimal if 
converges to zero such that loga^ = o(logpm)- 
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Proof. The proof follows easily using Lemmas 15.21 (for the fixed a case) and 15.31 
(for the fixed 5 case) and checking that, under the given conditions, assumptions 
f l6.64p -( l6l66l) are also satisfied. □ 

The following corollaries are analogous to Corollaries 15.5115.61 for BFDR control 
rules and hold when 

aim"^ <Vm< 02^""' , (6.74) 
for some constants G (0, oo), 02 G (0, 00) and 03 G (0, 1) 

Corollary 6.2 Consider the class of sparsity sequences Pm satisfying ( [6'. 74\ )- The 
BH rule controlling the FDR at a fixed level a G (0, 1) is asymptotically optimal for 
signals on the verge of detectability if and only if 

5m ^0 and ^ . (6.75) 

logm 

Remark 6.3 It is easy to check that under f l6.74p and (I6.75p . signals on the verge 
of detectability are of the form 

Um = /31ogm(l + Om) , with (3 G (0, 00) . (6.76) 

Corollary 6.3 Consider the class of sparsity sequences Pm satisfying ( [g. y^[ ). As- 
sume that the ratio between losses 5 is fixed. The BH rule is asymptotically optimal 
for signals on the verge of detectability Um = /31ogm(l + o^) if and only if a = am 
satisfies 

^ and ^^^^ _^ g . (6.77) 
logm 

6. 2.. 4 A comparison of the threshold for the optimal BH rule and the 
Bayes Oracle 

Here we briefiy discuss the relationship between the BH threshold and the Oracle 
threshold for signals on the verge of detectability. 

Suppose p — )■ and p > ^"^"^ , where (3 > 1. Moreover, assume that the ratio 
between losses 6 = Constant and that the signals are on the verge of detectability 
Um oc — log p. The BH rule is asymptotically optimal, provided the FDR control 
level a converges to zero such that — )■ 0. Moreover, under these conditions the 

log p ' 

BH threshold can be sandwiched between two optimal GW thresholds at neighboring 
FDR levels. It follows from the proof of Theorem 16.31 that for such an optimal a 
sequence, for any > 0, with probability greater than (1— m~^") (for all sufficiently 
large m), the BH threshold is given by 

cIh = 21og(l/p) - 2 log a - log(21og(l/p)) + Constant + o^. (6.78) 

On the other hand, the Bayes Oracle threshold is given by 

c|o = 2 log(l/p) + log(2 log(l/p)) + Constant + o^. (6.79) 
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So c^Q — c^BH = 2 log(log(l/]9)) + 2 log a + Constant + Om and it is clear that whether 
the threshold of the Bayes Oracle is larger or smaller than that of BH depends on 
the rate of convergence of a to zero. 



Now observe that under the above assumptions — — ). (0,11 and the above 
difference becomes 2 log(log m) + 2 log a + Constant + Om- In this case, for example, 
Cfio ~ '^'bh S06S to oo if a — > such that a > ^^^^^f^^, while it goes to — oo if, for 
example, a — j- in such a way that a < no m-iio Qo m) (^^ ^ong as log a = o(logm)). 



6. 2.. 5 Expected numbers of rejections for rules which are optimal on 
the verge of detectability 

Assume that pm satisfies assumption (16.631) and 6 = Const. Now consider a multiple 
testing rule which is asymptotically optimal on the verge of detectability Um oc 
— log Pm- Note that the threshold of such a rule is of the order of —2\ogpm and 
is proportional to the magnitude of signals on the verge of detectability. Thus, 
according to Lemma 13. H the expected number of true rejections is proportional to 
the expected number of true signals mpm = (with proportionality coefficient 
D = 2{1 - $(v^)) G (0, 1)). On the other hand, -2hgpm is of the order of the 
expected value of the largest statistic under the total null hypothesis. Thus, one 
might expect that the expected number of falsely rejected null hypotheses is also 
approximately of the order of Zm- However, it turns out that the second term in the 
asymptotic expansions for the asymptotically optimal rules (I6.78P and (I6.79P has a 
substantial influence on the probability of a type I error and, according to formula 
(I3.14p . the expected number of false rejections is of a slightly smaller order, of order 
log'i/p • Thus, in the case where z^, — >■ oo, the expected number of false rejections 
may converge to infinity, but the corresponding false discovery rate still converges 
to zero. 

Similarly, it is easy to check that for a BH rule with a fixed FDR a G (0, 1) 
and signals on the verge of detectability, the expected numbers of true and false 
discoveries are both proportional to z^- Recall that such a rule is asymptotically 
optimal if (5m — )■ and log((5m) = o(logp^ 



-'m j 



7 Discussion 

We have investigated the asymptotic optimality of multiple testing rules under spar- 
sity, using the framework of Bayesian decision theory. We formulated conditions for 
the asymptotic optimality of the universal threshold of [11] and the Bonferroni cor- 
rection. Moreover, as in [1], we have proved some asymptotic optimality properties 
of rules controlling the false discovery rate. Comparing with yy, we replaced a loss 
function based on the error in estimation with a loss function dependent only on the 
type of testing error. This resulted in somewhat different optimality properties of 
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BH. Specifically, we have proved that BH controlling the FDR at a fixed level a can 
be asymptotically optimal only if the relative cost of type II errors increases when 
p — >■ 0. However, this assumption does not undermine the desirable properties of 
BH controlling at a fixed FDR level, since it is quite natural to impose a large loss 
for missed signals in the case where p (equivalently, the total number of signals) is 
very small. Our results also provide some hints on how the "optimal" FDR level 
should be chosen, depending on the expected magnitude of true signals and the ratio 
between the loss for type I and type II errors. 

In recent years many Bayesian and empirical Bayes methods for multiple test- 
ing have been proposed which provide a natural way of approximating the Bayes 
oracle in the case where the parameters of the mixture distribution are unknown. 
The advantages of these Bayesian methods, both in parametric and nonparametric 
settings, were illustrated in e.g. [31], [13], [3], [1]. A further discussion on the mul- 
tiplicity adjustment, inherent to the appropriately designed Bayesian methods of 
model selection, can be found in [22]. In [1] it is shown that both fully Bayesian and 
empirical Bayes methods substantially outperform the Benjamini-Hochberg proce- 
dure for moderately small values of p. However, analysis of the asymptotic properties 
of fully Bayesian methods in the case where Pm — remains a challenging task. In 
the case of empirical Bayes methods, the asymptotic results given in ^ illustrate 
that consistent estimation of the mixture parameters is possible when p^ oc m~^, 
with /3 < 1. New results on the convergence rates of these estimates, presented in 
[20] , raise some hopes that proofs of the optimality properties of the corresponding 
empirical Bayes rules can be found. It is, however, not clear whether full or empir- 
ical Bayes methods can be asymptotically optimal in the extremely sparse case of 
Pm oc m~^. Note that in this situation the expected number of signals does not in- 
crease when m — )■ oo and consistent estimation of the alternative distribution is not 
possible. These doubts regarding the asymptotic optimality of Bayesian procedures 
in the extremely sparse case are partially confirmed by the simulation study in [1], 
where Bayesian methods are outperformed by BH and the Bonferroni correction for 
very small p. 

The Benjamini-Hochberg procedure can only be directly applied when the dis- 
tribution under the null hypothesis is completely specified, i.e. when a is known. 
In the case of testing a simple null hypothesis (i.e. when ctq = 0), cr can be es- 
timated using replicates. The precision of this estimation depends on the number 
of replicates and can be arbitrarily good. In the case where ctq > (i.e. when 
we want to distinguish large signals from background noise), the situation is quite 
different. In this case, a can only be estimated by pooling the information from all 
the test statistics using, for example, empirical Bayes methods (see e.g. 0]). While 
the simulation results reported in [3] show that for very small p BH can outperform 
Bayesian approximations to the oracle even in this context, it is rather unlikely that 
such a plug- in version of BH is asymptotically optimal in the case where p oc m~^. 
A thorough theoretical comparison of empirical Bayes versions of BH with Bayesian 
approximations to the Bayes oracle and an analysis of their asymptotic optimality 
remains an interesting topic for future research. 

In this paper we have modeled the test statistics using a scale mixture of normal 
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distributions. As already mentioned, we believe that the main conclusions of the pa- 
per will hold for a substantially larger family of two component mixtures, which are 
currently often applied to multiple testing problems (see e.g. [13]). Such two-group 
models assume a sharp distinction between the mechanisms generating the null and 
alternative hypotheses. In a recent article [7], a new "continuous" one-group model 
for multiple testing was proposed. As in our case, the test statistics are assumed to 
have a normal distribution with mean equal to zero, but the scale parameters are 
different for different tests and modeled as independent random variables from the 
one-sided Cauchy distribution. As discussed in [7], the resulting Bayesian estimate 
of the vector of means shrinks small effects strongly towards zero and leaves large 
effects almost intact. In this way, it enables very good separation of large signals 
from background noise. In [7J it is demonstrated that the results from the proposed 
procedure for multiple testing often agree with the results from Bayesian methods 
based on the two-group model. A thorough analysis of the asymptotic properties of 
the method proposed in [7] in the context of multiple testing remains a challenging 
task. However, we believe that the suggested one-group model has its own, very in- 
teresting virtues and [7] clearly demonstrates that the search for modeling strategies 
for the problem of multiple testing, as well as for the most meaningful optimality 
criteria, is still an open and active area of research. 
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8 Appendix 

8.1. Proof of Theorem 14.11 

We first prove the sufficiency of fl4.22l) and fl4.23p for the optimality of the multiple 
testing rule. 

The condition zt = o(logf ) implies that t2 = A^^^^{1 + ot) and the constant A 
is equal to or ^'^(^)~^ according to whether C is zero or strictly positive. Note 
that ( I3.17P and the fact that Zt = o(logf) together imply that the probability of a 
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type I error is given by 

t, = P{\Z\ > c) = fl^M_^^^ + (8.80) 

Now, assume that the constant C specified in Assumption (A) is equal to 0. 
Excluding the multiplier m, the type I error component of the risk (see fl3.19p ) is 
equal to -Ri = (1 — p)ti6o, while the component corresponding to type II errors is 
R2 = pt2^A- The ratio between Ri and R2 becomes 

^1 _ 5/v^exp(-zt/2) 
R2 Jv log V 



By the definition of v this is equal to 

^ = exp(-2t/2 - loglogt;)(l + o*), (8.81) 

which converges to zero if + 2 log log f 00. This shows that the overall risk is 
given by -R = mi?2(l + o^), which is equivalent to the expression in f l3.20p . 

In the case where C > 0, analogous steps give the required result. The only 
difference is that the ratio ^ is a different multiple of the expression in fl8.8ip . This 
completes the proof of the sufficiency part. 

We will now prove that under Assumption (A), both conditions fl4.22p and fl4.23p 
are necessary for optimality to hold. First we prove the necessity of condition f l4.22p 

Assume that (14.220 does not hold. Noting that Zt > — logf (since > 0), clearly 
this can happen if either (i) converges to a point in [—1, 00] — {0} or (ii) 
does not converge anywhere. 

First we consider case (i). This case leads to three distinct possibilities, dealt 
with separately in the following: 

• — )■ — 1 and Q does not diverge to infinity. In this case, there exists a 
constant Ci, < Ci < 00, and a subsequence of critical values c^, such that 
ci ^ C\. Observe that in this situation the probability of a type I error is 
given by 

ti = P{\Z\ > Ci) 2(1 - <l>(Ci)) = C2 > . 
Thus, the type I error component of the risk satisfies 

R, = m5o(l - p)C2il + o,) = + o,) . (8.82) 



As a result, in the case where C = 0, the ratio of the type I error component 
of the risk to the optimal risk given in fl3.20p is given by 

Ropt y 2\ogv 
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and the corresponding rule is clearly not optimal. 
Similarly, for C G (0, oo), 



Ri C2 Cv 

■{l + oi) 00 



Ropt 2^y/C) - l\ \ogv 

Now consider the case: zt = slogf (1 + Ot), s G [—1, 0) and q — )■ 00. It is easy 
to see that in this case 

V y/v^/\ogV + Zt 

Note that 

^ ^ ^1 _ 4exp(-2;t/2 -log(logt;)) 

Ropt Ropt V 1 + k^i^ 

where the constant A depends on the value of C. Observing that in this case 
Zf + 2 log log 1; —7- —00, it is clear that optimality does not hold. 

Next, we consider the case where s G (0, 00]. There exists a constant e > 0, 
such that for sufficiently large t, cf > logv(l + e). Thus, for all sufficiently 
large t, the probability of a type II error is given by 



VuTiJ \ V u + 1 



Now observe that 



logt;(l + e)\ 21ogt;(l + e) ,^ , , ^ ^ 

P \\Z\< = \ ^(1 + Ot) when C = 

V M + 1 / V vr?i 



and 



'logf (1 + e) 



u 



P\\Z\< \ = (2<l>( + e) - 1)(1 + Ot) when C > . 



This implies that in both cases the asymptotic ratio of R2 to Ropt is larger 
than 1 and the rule with threshold q is not optimal. 



Now we consider case (ii). In this situation there will be at least two distinct points 
in [—1, 00] (and hence at least one point different from zero) to each of which some 
subsequence of converges. By an analogous argument to case (i), optimal risk 
properties will not hold for a subsequence which converges to a point in [—1, 00] — {0}, 
and hence neither for the whole sequence. 

To conclude the proof, we prove the necessity of condition f l4.23p . Suppose f l4.23p 
does not hold. If fl4.22p does not hold either, optimality cannot hold, since fl4.22p is 
necessary for optimality, as proved above. If on the other hand, f l4.22p does hold, 
then the calculations leading to formula f l8.8ip remain valid. This implies that ^ 
does not converge to zero (since fl4.23p does not hold) and hence optimahty does 
not hold. 
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8.2. The BFDR is a decreasing function of the threshold c 

Notation: 

^a{x) and (f)a{x) - cdf and pdf of A^(0, cr^) 

and (f){x) - cdf and pdf of A^(0, 1) 
Proposition 8.1 If a > 1, then the function 

1 - '^Jx) 



H x) 



1 - 

is increasing on [0, oo) and 

lim H(x) = oo . (8.83) 

Proof. 

First we prove (18.831) . Observe that 

1 — $o-(a;) = / (j)a{x)dx 

J X 

and that 

(t)a{x) 1 

= — e 20-2 

(j){x) a 

Thus, for cr > 1, (pa^x) = (f){x)g{x), where g{x) is increasing on [0, oo) and 
hm^^oo 9{x) = oo. 

Thus VD > 0, 3xo > such that Wx > xq 

_ /r (l){x)g{x)dx 

which completes the proof of (I8.83p . 

To show that H{x) is increasing on [0, oo), consider an arbitrary pair of numbers 
Ci and C2, such that < ci < C2. Note that 

^ ^ (f){x)g{x)dx ^ /cT 0(x)fif(x)c;x + (j){x)g{x)dx 
^ /,70(x)dx J^^ (l){x)dx + (p{x)dx 

Observe that 

/ci" 0(a^)^(a^)c^a^ 

while 



Ic7 <P{x)dx 

This implies that H{ci) < H{c2), since for positive numbers a,b,c,d, 

a c . , a + c c 

- < - implies that < - . 

b d b + d d 



□ 



37 



Lemma 8.1 For any fixed p G (0, 1) and u > 0, the function 

(l-p)(l-$(c)) 



BFDR{c) 



(1 - p){i - $(c)) - $(c/y^ITT)) 

is monotonically decreasing from {1 — p) for c = to for c — ?■ oo. 

Proof. 

Noting that 

the conclusion of Lemma 18.11 easily follows from Proposition 18.11 □ 
8.3. Proof of Theorem [531 

The proof of Theorem 15.11 is based on the following two lemmas. 

Lemma 8.2 Under Assumption (A), a BFDR control rule ( 15. 35\} is asymptotically 
optimal only if 

f 

— ^oo (8.84) 

ra 

and 

2 log (^) 

^ C . (8.85) 

u 

Its threshold value is given by ^5.41\ ), i-e. 



4 = 2 log [^f^ - log [2 log l^f^ ^+C^ + Ot , (8.86) 

where Ci = In (^) and D = 2{1 - ^{VC). 

The corresponding probability of a type I error is given by 

t,=Dj{l + ot) . (8.87) 

Proof. From Theorem 14.11 the multiple testing rule based on c"^ can be asymp- 
totically optimal only if 

ti = 2(1 - ^{cb)) (8.88) 

and 



c 



B 



. (8.89) 

u + 1 ^ ^ 

From (15.361) . it follows that conditions f l8.88p and (I8.89P can both hold only if 
(Km is satisfied. 
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Moreover, fl5.36p and (18.891) imply that 

2{l-^CB)) = D^-j{l + o,) , (8.90) 

which proves (18.871) . 

Applying the normal tail approximation (I3.17P to the left hand side of (18.901) . we 
obtain 

TT Cs / 

Thus, 

4 = 2 log(//r«) + Ci - 2 log CB + ot 
and we subsequently obtain 

4 = 2 log (^pj - log (^2 log (^-lyj+Ci + Ot 

This implies that condition (I8.89P can hold only if 



u 



. (8.91) 

□ 



Lemma 18.21 shows that conditions (18.841) and (18.851) are necessary for the opti- 
mality of a BFDR control rule. In Lemma [8.31 we prove that they are also sufficient 
for iKm to hold. 



Lemma 8.3 Suppose Assumption (A) holds and a satisfies conditions 1^8. 84\ ) and 



Ii8.85\) . Then the threshold 4 (defined in Ii5. 35\) ) of a rule controlling the BFDR at 
level a is given by /18.86\) . 

Proof. 

First we will show that condition (I8.85P implies that = Zt is bounded from 

above as t — )■ oo. If this is false, one can find a subsequence t such that — )■ oo. 
Using the tail approximation (I3.17p . along this subsequence the following holds 



1 - $(cb) exp(-2;?M/2) 



1 + 



This together with (I5.36P yields 



2 2l0g(^) , 
Zf = ^^-^ + Of. 

u 



This, together with the fact that — > oo, contradicts ([E 
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As J^^Y is bounded from above, it always remains within a compact interval, as 
it is non-negative. So given any subsequence, there will be a further subsequence 
that converges to a finite constant. Now consider an arbitrary subsequence t such 
that J^j^Y — C3 < cxD along this subsequence and let the corresponding asymptotic 
power be D\ = 2(1 — $(6*3)) > 0. Along this subsequence, (15.361) reduces to 

1 - $(cb) = - 



2/ 

Since we assume fl8.84p . using the normal tail approximation f l3.17p . for this 
subsequence the corresponding thresholds are of the form 

4 = 2 log (^pj - log (^2 log (^pj ^+C, + Oi , (8.92) 
where Ci = log (^) • Now observe that (K92^ and (K85h imply that = VC. So 



we have proved that every convergent subsequence of -^^^f converges to v C. From 
the compactness of the sequence, this implies that the sequence itself will in fact 
converge to \/C. The proof of the lemma is now complete, since this implies that 
the asymptotic form of the threshold c% for the sequence itself will in fact satisfy 
(I8:92|) with Di = 2(1 - ^{VC)) (i.e., it will satisfy ( K8U\f ). □ 

Proof of Theorem 15.11 

First we prove that for optimality to hold conditions f l5.39p and f l5.40p on st (as 
defined in ( KM ) are sufficient. Note that ( 091) guarantees that ( KM and ( KEB 
hold. According to Lemma 18.31 the threshold of the BFDR control rule can thus be 
written as 

c| = \ogv + Zt 

with 

zt = -\ogv + 2\og(p\ -\oglr2\og(py\+Ci + Ot . 



According to Theorem 14. H such a rule is asymptotically optimal if 

log - 2 log log log ""^^^^ ^^'^^^ 

and 

log - 2 log (^y^ + log 1^2 log 1^^^ j - 2 log log v ^ -00 . (8.94) 

Since flCTl) holds, condition Km is fulfilled if and only if 

21og(//rJ _ 21og(//rJ ^ 
\ogv 21og(/50i) 

i.e. if and only if condition fl5.39p holds. From f l5.38p . we obtain 

\ogv = 2(1 + St) log(//r„) 
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and simple computations show the equivalence of (18.941) and (15.401) . 

Let us now prove that for optimality to hold, it is also necessary that conditions 
(I5.39P and ( I5.40p are satisfied by st- The optimality of a BFDR control rule implies, 
from Lemma [H21 that both (I8.84p and (18.851) hold. We may thus use the asymptotic 



approximation to the threshold given in (18. 86 p . Since the rule is optimal. Theo- 
rem (14.11) implies that (I8.93P and (I8.94p must hold, which have been shown to be 
equivalent to flQgj) and fl5^ . 

8.4. Proofs of Theorems 16.11 and 16.31 

First we will prove two lemmas needed for the proof of Theorem 16. 3[ 

Proof of Lemma 16.11 

Proof. Given the condition K = k, there are (m — k) true nulls. Let the 
corresponding ordered p- values be < ... < p^m-k)- Imagine that we apply to 
these p-values the following procedure BH^ which rejects the hypotheses whose 
p- values are smaller than p^^^, where 

7 f _ Oiii + k)\ 

k = argmaxi < < > . (8.95) 

Let Vi be the corresponding number of rejections. Then E{V\K = k) < E{Vi), since 
the number of false rejections for the original BH, V, is not larger than Vi. Now, 
consider m i.i.d p- values qi, . . . ,qm from the total null (i.e, each of the m nulls is 
true), which are independent of the given original p- values. Let g(i) < . . . < q{m-k) 
be the ordered values from the subsequence gi, . . . , qm~k- Then . . . , q^m-k) and 
■■■,p(m-k) have exactly the same distribution. Let Vi and V2 be the number of 
rejections of null when the procedure (I8.95P is applied to the first (m — k) or m g's 
respectively. Then E{V\K = k) < E{Vi) = E{Vi) < E{V2). 

Now the bound on k (see the assumption of Lemma 16. ip guarantees that the 
right hand side of (I8.95P is smaller than 1 for all possible i. We can thus apply 
Lemma 4.2 of [16] directly, which yields 

E{V2) = aj:ik + t + l)hT\\ (- 

Routine calculations now lead to Lemma 16.11 

/ k 



E{V2) <aY.ik + i + l) 



a = a 



□ 



Proof of Lemma 16.21 
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Proof. Define C2 '■= 1 and mo := min(m, C2fn). The following holds 

mo 

E{V) < ^ E{V\K = k)P{K = k)+ mP{K > Cgm) . (8.96) 

fe=0 

The first term can be bounded for m large enough using Lemma 16.11 : 



Y^E{V\K = k)P{K = k)<—- 

2-ao 



where Cy is any constant larger than ■ Now observe that ^ (1 — p„i)"^ 

converges to if s = 00 or to ^ \i otherwise. Hence, it follows that 

mo 

^ E{y\K = k)P{K = k)< CimamPm , 

k=0 

for any constant Ci satisfying the assumption of Lemma 16.21 

Finally, note that the second term of fl8.96p vanishes for aoo < 0.5. On the other 
hand, for ctoo G [0.5, 1), Lemma 7.1 of [T] yields 

mP{K > Cam) < mexp{-^mpmh{C2/pm)) , 

where h{x) = min(|x — 1|, |x — Ip). If Pm 0, then for any constant C3 G (0, C2) and 
sufficiently large m, the right hand side is bounded from above by m exp(— Csm) — )■ 
. Now, from the assumptions mpm — )■ s > and am — >■ ctoo > 0.5, it follows that 
for any constant c > and sufficiently large m, the second term of (18.961) is smaller 
than camTTtpm and Lemma [6.21 follows. □ 

Proof of Theorem 16.31 

The proof of Theorem 16.31 is based on a sequence of four lemmas. The first of 
these lemmas states that when — ^ such that mpm 00, the tail probability 
(1 — F{cgw)) (of ^ being greater than the asymptotically optimal threshold cgw) 

can be very well approximated by its estimate (1 — -Fm(cGiy)) based on the empirical 
distribution function. 



Lemma 8.4 Suppose Assumption (A) is true and Pm — ^ 0. Consider the multiple 
testing rule based on the GW threshold cqw, defined in ( [5.5/^[ ), where the level am 
is chosen in such a way that this rule is asymptotically optimal (i.e the condition in 
Theorem \5.1\ is satisfied). For any constant ^ G (0, 1), 



and 
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Proof. Let 1-F{cgw) = (l-Pm)ti(cGvi/)+Pm(l-t2(cGiy)), where h^CGw) and 
hiccw) denote the probabihty of type I and type II errors for the procedure based 
on the GW threshold. Theorem 15.21 implies that (15.56^ holds and hence ti{cGw) and 
hiccw) are exactly of the same asymptotic form as the corresponding probabilities 
of type I and type II errors for the rule controlling the BFDR at the same level. 
Therefore, using Theorem 15. ![ we obtain 



l-F{cGw)=PmD{r^^ + l){l + Om) . (8.99) 

Now observe that Y = m(l — Fm{cGw)) is a Binomial B{m, 1 — F{cgw)) random 
variable. Therefore, from Bennett's inequality (e.g. see [30j, page 440), it follows 
that 

P{Y > m(l - F{cGw)){l + 0) < exp |-^mp„D(r,„ + 1)^^(1 + o„)| 

and 

P{Y < m(l - F{cGw))il - 0) < exp |-imp^D(r,„ + 1)^^(1 + o™)| 
and the proof of Lemma 18.41 is complete. □ 



The next lemma shows that with a large probability the random threshold of 
BH is bounded from above by a sequence which converges to cgw- 

Lemma 8.5 Suppose Assumption (A) holds. Assume that am and pm satisfy the 
assumptions of Lemma 8^. Let cbh be the BH threshold at level am and let ci = cim 
be the GW threshold ( [5. 54^ at level aim = am(l — ^m), where -> and ^m = 
o(l — am) as m —¥ 0. It follows that 

Cl=CGW + Om (8.100) 

and ^ 

P{cbh > ci) < exp |--mp™D(r„„ + 1)^^(1 + o„)| . 

Proof. Note that since ^m = o(l — am) and ^m — ^ 0, it follows that r^^^ = 
ra„^(l + Om)- Thus, similarly to cgw, ci satisfies equation (15.561) and (I8.100p follows. 
Now observe that from the definition of cbh, 

Pics.<~c.) > Pf^^^<«.) 

V(l-F„(ci)) / 

" I l-Fmih)- [l~F{ci) - ' 

Now the proof follows from fl8.100p and invoking the arguments of Lemma 18.41 □ 

Based on Lemmas 18.41 and 18.51 we can now prove Lemma 18.61 which provides an 
upper bound on cbh in terms of cgw- 
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Lemma 8.6 Suppose that Assumption (A) holds and that pm and am satisfy the 
assumptions of Theorem \ 6.3[ Let Cbh and Cqw denote the thresholds of BH and 



their Genovese-Wasserman approximation ( 5.54 ) corresponding to FDR level am- 
It follows that for every e > 0, every constant > and for sufficiently large m, 
we have 

P{CBH > CGw + e) < m-f- . (8.101) 

Proof. We first observe that when Pm ^ and assumptions fl6.64p . fl6.65p 
and fl6.7ip hold, one can invoke Lemma 18.51 by choosing ^m = {log mY^, where 
G ((1 — /3p)/2, 0). By doing this, we conclude that for every constant /3„ > and 
sufficiently large m 

P{cbh > ci) < (8.102) 
since by (16:621) mpm > hg^^m. Thus flHJOT]) follows from flHTTOOj) . □ 



To finish the proof of Theorem 16. 3[ we show that with a very large probability 
Cbh can be bounded from below by a sequence which asymptotically converges to 
cgw- The proof of this is substantially more complicated than the previous proof, 
since for cbh > C2 to hold, it is necessary that > q, for all y < C2- 

1 — Pm(y) 

Lemma 8.7 Suppose that Assumption (A) holds and that pm and am satisfy the 
assumptions of Theorem \ 6.3[ Let cbh and cqw denote the thresholds of BH and 
their Genovese-Wasserman approximation ( [5. 54\ ) corresponding to FDR level am- 



For every e > 0, every constant /3„ > and sufficiently large m, the following holds 

P{cbh < Cgw - e) < m-^- ■ (8.103) 

Proof. 

Let C2 = C2m be the GW threshold (15.540 at the level a2m = «m(l + ^m), where 
C,m = (logm)'^f, with G ((1 — /3p)/2,0). Since ^m — ^ 0, it follows that r^j^ = 
ra^{l + Om) and C2 satisfies equation (15.561) with a = am- Thus it easily follows that 
£2 is asymptotically optimal and satisfies 

C2 = Cgw + Om , (8.104) 

where cgw is given by (I5.54p with a = am- 

Now observe that from the monotonicity of ^\Z'pfc) 

CBH < C2 if and only if 2(l-^(cBJf)) ^ ^^^^ ^ ^g^^^^^ 

1 - F[cbh) 



Moreover, from the definition of cbh-, 

2{l -^cbh)) 

1 - Fm{cBH) 

Thus, the event {cbh < £2} implies that 

1 — FmicBH 



< Olr 



1 - F{cbh) 
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and in consequence 

P{CBH <C2)<P( sup ^ > (1 + U)] ■ (8.106) 
Using the standard transformation Ui = F ( ) , we obtain 



P{cBHe[0,C2))<p( sup ^ ^ '^'f^ > (1 + U] , 

where Zim = F{c2) = 1 — PmD{fam + + Om) and Gm{t) is the empirical cdf for 
m independent observations from the uniform distribution on [0, 1]. 

Let Ui = ^ and kim = \^{'^—CiPm)] , where Ci G (0, D). From the monotonicity 
of Gm{t) and t, it follows that for sufficiently large m 

P sup ^- > (1 + < 

\te[o,2i^) -L - r / 

5] P 1 - ^^(n,) > (1 - --)(! + U) ■ (8.107) 



i=0 



m 



Note that for i = and sufficiently large m, 

P(l- G^Ui) > (1 - Mi - -)(1 + 

V m 

We now consider the case i G {1, . . . , /cim}. 
Observe that over this range of i 

1 ^ ^ log^^'m 1 

1 — tij > Ci 

m m 

and therefore ^ 

(1 - Mi ) = (1 - Mi)(l - tm) , 

m 

where tm = 0([logm]-^f) = o(^„). 

Now using Bennett's inequality, we obtain that for every i G {1, . . . , kim} 

P(l - GUui) > (1 - u,)il + ej(l - U) 

< exp - Ui)^^{l + o„ 

< exp (^-^(logm)^''e^(l + oj) 

for any constant C2 G (0,Ci). Thus for any constant > 0, sufficiently large m 
and any i G {1, . . . , kim}, the following holds 

mP(l - GrrM) > (1 - «i)(l + em)(l " tm)) < m"^" 

and Lemma 18.71 follows by invoking (I8.107P and fl8.104p . □ 
The proof of Theorem 16.31 results from combining Lemma 18.61 and Lemma 18.71 
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